statistical approach
A Statistical Approach for Synthetic EEG Data Generation
Vos, Gideon, Ebrahimpour, Maryam, van Eijk, Liza, Sarnyai, Zoltan, Azghadi, Mostafa Rahimi
Electroencephalogram (EEG) data is crucial for diagnosing mental health conditions but is costly and time-consuming to collect at scale. Synthetic data generation offers a promising solution to augment datasets for machine learning applications. However, generating high-quality synthetic EEG that preserves emotional and mental health signals remains challenging. This study proposes a method combining correlation analysis and random sampling to generate realistic synthetic EEG data. We first analyze interdependencies between EEG frequency bands using correlation analysis. Guided by this structure, we generate synthetic samples via random sampling. Samples with high correlation to real data are retained and evaluated through distribution analysis and classification tasks. A Random Forest model trained to distinguish synthetic from real EEG performs at chance level, indicating high fidelity. The generated synthetic data closely match the statistical and structural properties of the original EEG, with similar correlation coefficients and no significant differences in PERMANOVA tests. This method provides a scalable, privacy-preserving approach for augmenting EEG datasets, enabling more efficient model training in mental health research.
Risk factor identification and classification of malnutrition among under-five children in Bangladesh: Machine learning and statistical approach
Mahmud, Tasfin, Wara, Tayab Uddin, Joy, Chironjeet Das
This study aims to understand the factors that resulted in under-five children's malnutrition from the Multiple Indicator Cluster (MICS-2019) nationwide surveys and classify different malnutrition stages based on the four well-established machine learning algorithms, namely - Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Multi-layer Perceptron (MLP) neural network. Accuracy, precision, recall, and F1 scores are obtained to evaluate the performance of each model. The statistical Pearson correlation coefficient analysis is also done to understand the significant factors related to a child's malnutrition. The eligible data sample for analysis was 21,858 among 24,686 samples from the dataset. Satisfactory and insightful results were obtained in each case and, the RF and MLP performed extraordinarily well. For RF, the accuracy was 98.55%, average precision 98.3%, recall value 95.68%, and F1 score 97.13%. For MLP, the accuracy was 98.69%, average precision 97.62%, recall 90.96%, and F1 score of 97.39%. From the Pearson co-efficient, all negative correlation results are enlisted, and the most significant impacts are found for the WAZ2 (Weight for age Z score WHO) (-0.828"), WHZ2 (Weight for height Z score WHO) (-0.706"), ZBMI (BMI Z score WHO) (-0.656"), BD3 (whether child is still being breastfed) (-0.59"), HAZ2 (Height for age Z score WHO) (-0.452"), CA1 (whether child had diarrhea in last 2 weeks) (-0.34"), Windex5 (Wealth index quantile) (-0.161"), melevel (Mother's education) (-0.132"), and CA14/CA16/CA17 (whether child had illness with fever, cough, and breathing) (-0.04) in successive order.
Statistical tuning of artificial neural network
Mohamad, Mohamad Yamen AL, Bevrani, Hossein, Haydari, Ali Akbar
Neural networks are often regarded as "black boxes" due to their complex functions and numerous parameters, which poses significant challenges for interpretability. This study addresses these challenges by introducing methods to enhance the understanding of neural networks, focusing specifically on models with a single hidden layer. We establish a theoretical framework by demonstrating that the neural network estimator can be interpreted as a nonparametric regression model. Building on this foundation, we propose statistical tests to assess the significance of input neurons and introduce algorithms for dimensionality reduction, including clustering and (PCA), to simplify the network and improve its interpretability and accuracy. The key contributions of this study include the development of a bootstrapping technique for evaluating artificial neural network (ANN) performance, applying statistical tests and logistic regression to analyze hidden neurons, and assessing neuron efficiency. We also investigate the behavior of individual hidden neurons in relation to out-put neurons and apply these methodologies to the IDC and Iris datasets to validate their practical utility. This research advances the field of Explainable Artificial Intelligence by presenting robust statistical frameworks for interpreting neural networks, thereby facilitating a clearer understanding of the relationships between inputs, outputs, and individual network components.
A statistical approach to detect sensitive features in a group fairness setting
Pelegrina, Guilherme Dean, Couceiro, Miguel, Duarte, Leonardo Tomazeli
The use of machine learning models in decision support systems with high societal impact raised concerns about unfair (disparate) results for different groups of people. When evaluating such unfair decisions, one generally relies on predefined groups that are determined by a set of features that are considered sensitive. However, such an approach is subjective and does not guarantee that these features are the only ones to be considered as sensitive nor that they entail unfair (disparate) outcomes. In this paper, we propose a preprocessing step to address the task of automatically recognizing sensitive features that does not require a trained model to verify unfair results. Our proposal is based on the Hilber-Schmidt independence criterion, which measures the statistical dependence of variable distributions. We hypothesize that if the dependence between the label vector and a candidate is high for a sensitive feature, then the information provided by this feature will entail disparate performance measures between groups. Our empirical results attest our hypothesis and show that several features considered as sensitive in the literature do not necessarily entail disparate (unfair) results.
Not Cheating on the Turing Test: Towards Grounded Language Learning in Artificial Intelligence
Recent hype surrounding the increasing sophistication of language processing models has renewed optimism regarding machines achieving a human-like command of natural language. Research in the area of natural language understanding (NLU) in artificial intelligence claims to have been making great strides in this area, however, the lack of conceptual clarity/consistency in how 'understanding' is used in this and other disciplines makes it difficult to discern how close we actually are. In this interdisciplinary research thesis, I integrate insights from cognitive science/psychology, philosophy of mind, and cognitive linguistics, and evaluate it against a critical review of current approaches in NLU to explore the basic requirements--and remaining challenges--for developing artificially intelligent systems with human-like capacities for language use and comprehension.
Using Machine Learning in the Evolving Landscape of Real-World Data
According to the Food and Drug Administration (FDA), the term real-world data (RWD) refers to routinely collected data relating to patient health status and the delivery of healthcare services, and real-world evidence (RWE) is the clinical evidence regarding the usage and potential benefits or risks of a medical product derived from the analysis of RWD. Both RWD and RWE have increasingly attracted attention in the healthcare industry for years now, and rightly so, considering that the healthcare analytics market is expected to expand at a compound annual growth rate of 28.9% between now and 2026. There's no doubt that within this massive data trove, there exist countless insights that could streamline care delivery, help physicians diagnose disease faster, and improve treatment strategies โ if only we could identify them. This data revolution we are experiencing in the healthcare industry necessitates the appropriate tools and approaches to work with higher dimensional data sources to truly harvest the insights buried in RWD. Machine learning, an area of artificial intelligence (AI) consisting of a collection of methodologies that focus on algorithmically learning efficient representations of data and extracting insights from data, offers promise and has consistently been gaining traction within the industry in the context of RWD.
MLSys 2021: Bridging the divide between machine learning and systems
Machine learning MLSys 2021: Bridging the divide between machine learning and systems Amazon distinguished scientist and conference general chair Alex Smola on what makes MLSys unique -- both thematically and culturally. Email Alex Smola, Amazon vice president and distinguished scientist The Conference on Machine Learning and Systems ( MLSys), which starts next week, is only four years old, but Amazon scientists already have a rich history of involvement with it. Amazon Scholar Michael I. Jordan is on the steering committee; vice president and distinguished scientist Inderjit Dhillon is on the board and was general chair last year; and vice president and distinguished scientist Alex Smola, who is also on the steering committee, is this year's general chair. As the deep-learning revolution spread, MLSys was founded to bridge two communities that had much to offer each other but that were often working independently: machine learning researchers and system developers. Registration for the conference is still open, with the very low fees of $25 for students and $100 for academics and professionals. "If you look at the big machine learning conferences, they mostly focus on, 'Okay, here's a cool algorithm, and here are the amazing things that it can do. And by the way, it now recognizes cats even better than before,'" Smola says. "They're conferences where people mostly show an increase in capability. At the same time, there are systems conferences, and they mostly care about file systems, databases, high availability, fault tolerance, and all of that. "Now, why do you need something in-between? Well, because quite often in machine learning, approximate is good enough. You don't necessarily need such good guarantees from your systems.
Machine learning revolutionizes methods to quantify the terrestrial biosphere
Researchers from the University establish a new methodology to improve, from space and through machine learning, the observation and analysis of the terrestrial biosphere. This statistical approach will represent a significant advance in monitoring crops and carbon sinks, as well as in predicting floods and droughts. The work has been published in the journal Science Advances. The new machine learning methodology makes it possible to improve the precision in the prediction of key parameters, such as the leaf area index, the gross primary productivity and the fluorescence of the chlorophyll induced by the sun, among others. The field of applications is huge and will be of great use to improve the monitoring of crops and carbon sinks, detect changes and anomalies, droughts and floods.
Best Books to Expand Your NLP Knowledge
The abundance of knowledge and resources can be at times overwhelming specifically when you are talking about new age technologies like Natural Language Processing or what we popularly call it as NLP. When trying to educate yourself, you should always choose resources with solid base and fresh books to impart unprecedented package of learnings. Here is the list of top books that can help you expand your NLP knowledge. One of the most widely referenced and recommended NLP books, written by Stanford University professor Dan Jurafsky and University of Colorado professor James Martin, provides a deep-dive guide on the subject of language processing. It's intended to accompany undergraduate or advanced graduate courses in Natural Language Processing or Computational Linguistics. However, it's a must-read for anyone diving into the theory and application of language processing as they grow and strengthen their analytics capabilities.
Anomaly Detection in Univariate Time-series: A Survey on the State-of-the-Art
Braei, Mohammad, Wagner, Sebastian
Anomaly detection for time-series data has been an important research field for a long time. Seminal work on anomaly detection methods has been focussing on statistical approaches. In recent years an increasing number of machine learning algorithms have been developed to detect anomalies on time-series. Subsequently, researchers tried to improve these techniques using (deep) neural networks. In the light of the increasing number of anomaly detection methods, the body of research lacks a broad comparative evaluation of statistical, machine learning and deep learning methods. This paper studies 20 univariate anomaly detection methods from the all three categories. The evaluation is conducted on publicly available datasets, which serve as benchmarks for time-series anomaly detection. By analyzing the accuracy of each method as well as the computation time of the algorithms, we provide a thorough insight about the performance of these anomaly detection approaches, alongside some general notion of which method is suited for a certain type of data.